NOTE: The dataset, visualizations, and result outputs in this presentation are not representative for any types of business, users, reviews in Yelp.

1. Yelp Academic Dataset #1 (Small)


1.1. Initial Questions


  • Initial questions without the Yelp data
    • A New Yorker is more likely to give higher review scores than people in other states
    • Female tends to give higher review scores (and / or) more review counts than male
    • There will be more restaurants with high review scores in New York than other states.




1.2. Simple exploratory analysis about the Yelp dataset:

  • 1 table
    • Business (24 variables, 474,434 observations, 289.4 MB)


  • Qustions with the data
    • Average review ratings by state
    • More detailed review ratings by state


  • Columns of interest
    • stars (1.0, 1.5, 2.0, ~ 4.5, 5.0)
    • review_count
    • state (16 states)
    • latitude/longitude



ER Diagram for Yelp Samll Dataset




1.3. Average review ratings by state.


a. Pulling the data

dataGroupByStateStar <- ylpDataSmall %>% 
  filter(state != '') %>% mutate(tsum = n()) %>% 
  group_by(state, stars) 

dataForTableByStateStar <- dataGroupByStateStar %>% group_by(state) %>%
  summarise(total_business = n(), total_reviews = sum(review_count), avg_rating = round(mean(stars), 2))



b. Loading the data on the Leaflet map

library(leaflet)
leaflet(dataTotalAvgStarByState) %>% addTiles() %>% setView(lng = -96.503906, 
    lat = 38.68551, zoom = 4) %>% addCircles(lng = ~city_lng, lat = ~city_lat, 
    weight = 0, radius = ~exp(totAvgRatingByState * 1.4) * 800, fillOpacity = 0.5, 
    color = ~myCol(totAvgRatingByState), popup = ~totAvgRatingByState) %>% addLegend("bottomleft", 
    pal = myCol, values = ~sort(totAvgRatingByState), title = "Avg.Ratings", 
    labFormat = labelFormat(prefix = ""), opacity = 0.5)



1.4. A grid of detailed average ratings by state.


a. Pulling the data

dataWeightedGroupByStateStar <- dataGroupByStateStar %>% 
  summarise(totalByStar = n()) %>% arrange(desc(stars)) %>% 
  mutate(total = sum(totalByStar)) %>% mutate(percent = round((totalByStar / total)*100, 1)) %>%
  mutate(percentWeight = ifelse(percent >= 20, percent * 2.5, # custom column to weight the percent for size on the plot
                                ifelse(percent < 20 & percent >= 15, percent * 1.2, 
                                       ifelse(percent < 15 & percent >= 10, percent,
                                              ifelse(percent < 10 & percent >= 5, percent * 0.8, 1)))))


b. Loading the data on the ggplot bubble plot

library(ggplot2)
ggplot(dataWeightedGroupByStateStar, aes(x = state, y = stars, label = percent)) + 
    geom_point(aes(size = percentWeight * 2, colour = stars, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 4) + scale_size(range = c(1, 30), guide = "none") + 
    scale_color_gradient(low = "darkblue", high = "red") + labs(title = "A grid of detailed avg.ratings by state ", 
    x = "State", y = "Detailed Avg.Ratings") + scale_y_continuous(breaks = seq(1, 
    5, 0.5)) + theme(legend.title = element_blank())



—

2. Yelp Academic Dataset #2 (Big)


2.1. Initial questions


  • Initial questions
    • Users with more followers at Yelp –> purchase more (or checkin more frequently)
    • Female users –> More frequent reviews
    • Restaurants in the higher price range –> higher review ratings?
  • Source link is currently missing.



2.2. Simple exploratory analysis:


  • 5 tables
    • Business (98 variables, 77,445 observations, 30.4MB)
    • User (23 variables, 552,339 observations, 135.9MB)
    • Reviews (10 variables, 2,225,213 observations, 1.64GB)
    • Checkin (170 variables, 55,569 observations, 13.2MB)
    • Tip (6 variables, 591,864 observations, 74.5MB)


  • My qustions with the data
    • Having more followers at Yelp –> Rate more frequently?
    • Having more followers at Yelp –> Rate higher?


ER Diagram for Yelp Big Dataset

  • Columns of interests
    • average_stars (1.0, 1.5, 2.0, ~ 4.5, 5.0)
    • elite
    • fans



2.3. (Q1) Having more followers at Yelp –> Rate more frequently?


a. Pulling the data

ylpUserSmElite <- ylpUserSm3 %>% filter(elite != "[]")
ylpUserSmNormal <- ylpUserSm3 %>% filter(elite == "[]")



b. Loading the data on the box plot

library(ggthemes)
# Yelp users in the boxplot
qplot(fans, review_count, data = ylpUserSm3, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Total review counts by the number of fans") + 
    theme(legend.position = "none")



# Elite Yelp group users in the boxplot
qplot(fans, review_count, data = ylpUserSmElite, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Total review counts by the number of fans (Elite users)") + 
    theme(legend.position = "none")



# Non-elite Yelp group users in the boxplot
qplot(fans, review_count, data = ylpUserSmNormal, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Total review counts by the number of fans (Non-elite Users)") + 
    theme(legend.position = "none")



c. Loading the data on the combination plots (point+smooth)

# Yelp users in combination plots
qplot(fans, review_count, data = ylpUserSm1, geom = c("point", "smooth"), colour = fans) + 
    labs(title = "Total review counts by the number of fans") + scale_color_gradient(low = "darkblue", 
    high = "darkred") + stat_smooth(fill = "green", colour = "cyan", size = 1, 
    alpha = 0.1)



# Elite Yelp group users in combination plots
qplot(fans, review_count, data = ylpUserSmElite, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Total review counts by the number of fans (Elite users)") + 
    scale_color_gradient(low = "darkblue", high = "darkred") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)



# Non-elite Yelp group users in combination plots
qplot(fans, review_count, data = ylpUserSmNormal, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Total review counts by the number of fans (Non-elite users)") + 
    scale_color_gradient(low = "darkblue", high = "darkred") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)



2.3. (Q2) Having more followers at Yelp –> Rate higher?


a. Loading the data on the box plot

# Yelp users in the boxplot
qplot(fans, average_stars, data = ylpUserSm3, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Average ratings by the number of fans") + 
    theme(legend.position = "none")



# Elite Yelp group users in the boxplot
qplot(fans, average_stars, data = ylpUserSmElite, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Average ratings by the number of fans (Elite users)") + 
    theme(legend.position = "none")



# Non-elite Yelp group users in the boxplot
qplot(fans, average_stars, data = ylpUserSmNormal, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Average ratings by the number of fans (Non-elite users)") + 
    theme(legend.position = "none")



c. Loading the data on the combination plots (point+smooth)

# Yelp users in combination plots
qplot(fans, average_stars, data = ylpUserSm1, geom = c("point", "smooth"), colour = fans) + 
    labs(title = "Average ratings by the number of fans") + scale_color_gradient(low = "darkblue", 
    high = "darkred") + stat_smooth(fill = "green", colour = "cyan", size = 1, 
    alpha = 0.1)



# Elite Yelp group users in combination plots
qplot(fans, average_stars, data = ylpUserSmElite, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Average ratings by the number of fans (Elite users)") + 
    scale_color_gradient(low = "darkblue", high = "red") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)



# Non-elite Yelp group users in combination plots
qplot(fans, average_stars, data = ylpUserSmNormal, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Average ratings by the number of fans (Non-elite users)") + 
    scale_color_gradient(low = "darkblue", high = "red") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)